1 Overview

The paaPack is designed to perform hierarchical Principal Amalgamation Analysis (HPAA) with or without the guidance of taxonomic tree structure, and provide several useful graphical tools for visualizing the results of HPAA, including 1) hierarchical dendrograms to visualize the full path of amalgamations, 2) the scree plot showing the percentage change in the diversity loss along with the changes of number of compositions, and 3) the ordination plot showing the changes in the between-sample distance patterns before and after HPAA with any given number of principal compositions (PC). In this tutorial, we also provide a R shiny app to dynamically visualize the changes of ordination plots along with the path of HPAA, i.e., from the largest number to the smallest number of PCs.

1.1 Set up

install.packages("../Rpackage/paaPack_0.0-1.tar.gz", repos = NULL, type="source")
library(paaPack)
#### functions in paaPack
help(hPAA)  #### fit HPAA models
help(plotHPAA)  #### dendrogram showing the hierarchical amalgamation
help(plotLine)  #### scree plot showing the percentage
help(plotMDS)  #### ordination plot

The paaPack provides a main function hPAA() to perform the hierarchical Principal Amalgamation Analysis (HPAA). To use the function, the analyst should provide:

  • The compositional data with row representing compositions for each subject and column recording the original components/taxa.

  • The taxonomic tree structure as a vector with dimension same as the number of taxa. The format of the taxonomic vector is the same as output format of the commonly used bioinformatics data processing software Mothur. That is, each element of the vector denotes the full taxonomic ranks from kingdom to genus, species level of the taxon, with ranks separated by semicolon. For example a typical element of the taxonomic vector could be k_Bacteria;p_Actinobacteria;c_Actinobacteria;o_Bifidobacteriales;f_Bifidobacteriaceae;g_Bifidobacterium;s_longum. The taxonomic structure is optional. If taxonomy is not provided, the unconstrained HPAA without tree guidance is performed.

  • The diversity measures used in the HPAA analysis, and indicate whether strong or weak taxonomic hierarchy is applied in the analysis.

Then a set of plotting methods are provided taking the object of class “hPAA” from the hPAA() as input, including plotHPAA() for the dendrogram showing the full path of hierarchical amalgamation, plotLine() for the scree plot showing the percentage change in the diversity loss with the changing in the number of principal compositions, and plotMDS() for the ordination plot showing the changes in the between-sample distance patterns before and after HPAA. In each function, a group of graphical arguments for shaping the figures could be specified. For details, the analyst could refer to the documentations of the functions using help().

2 Illustration

In the following sections, we use the NICU data to illustrate the visualization tools provided by paaPack. The codes for constructing all the figures are documented in the corresponding code chunks of the source R markdown file. These tools can be extremely useful for visualizing and understanding compositional data, as well as helping to determine the desired number of principal compositions in practice.

2.1 Dendrogram

We construct a HPAA dendrogram to simultaneously visualize both the tree diagram of the successive amalgamations and the taxonomic structure of the taxon using the function plotHPAA(). To illustrate, Figure 2.1 shows the HPAA dendrogram from performing HPAA with SDI loss and strong taxonomic hierarchy on the NICU data. The top part of figure shows the dendrogram of amalgamations, where the \(y-\)axis shows the percentage decrease in total diversity as measured by SDI (on the log-scale) along the successive amalgamations, from the bottom to the top. As such, any horizontal cut of the dendrogram at a desired level of diversity loss/preservation shows the corresponding amalgamated data. In particular, each red dashed horizontal line indicates the steps at which the original data are aggregated to a higher taxonomic rank. It shows that, for example, aggregating data to the order level (22 taxa or principal compositions left) through HPAA leads to 22.3% loss in total SDI. At the bottom part, we use color bars to show taxonomic structure of the taxa, where in each horizontal bar taxa of the same color belong to the same category of that rank.

The NICU data: Dendrogram of HPAA with SDI and strong taxonomic hierarchy.

Figure 2.1: The NICU data: Dendrogram of HPAA with SDI and strong taxonomic hierarchy.

Then we display the results under different ways of taxonomy guidance of each loss function in one combined figure for intuitive comparison. Figures 2.2 and 2.3 show HPAA dendrograms with SDI loss and BC loss, respectively, under all three levels of taxonomy guidance. Not surprisingly, the patterns of amalgamations vary under different settings. Without taxonomic constrain, the change in diversity appears to be very smooth along the amalgamations, but the resulting principal compositions may not be easily interpretable, as indicated by the mixed color patterns in the color bars of the taxonomic rank. On the other hand, for the setting of strong taxonomic hierarchy, while the principal compositions are forced to closely follow the taxonomic structure, the percentage change in diversity tends to exhibit dramatic jumps, especially at the steps that the last remaining taxon at a lower taxonomic rank is forced to be aggregated to a higher rank. As a compromise, for the setting of weak taxonomic hierarchy, the resulting principal compositions remain interpretable, and the percentage change in diversity remains smooth and can be quite close to that of the unconstrained setting in the early stage of amalgamations.

The NICU data: HPAA dendrograms with SDI and different constrains on taxonomic hierarchy.

Figure 2.2: The NICU data: HPAA dendrograms with SDI and different constrains on taxonomic hierarchy.

The NICU data: HPAA dendrograms with Bray-Curtis and different constrains on taxonomic hierarchy.

Figure 2.3: The NICU data: HPAA dendrograms with Bray-Curtis and different constrains on taxonomic hierarchy.

2.2 Scree Plot

Next we use the function plotLine() to construct the scree plot for the results of HPAA under different types of taxonomy guidance. The scree plot shows the percentage change in the diversity loss as a function of the number of principal compositions. Figure 2.4 shows the scree plots from performing HPAA on the NICU data under different settings. The difference among the three levels of taxonomic guidance is very revealing, which confirms the previous observation from the dendrograms that the setting of weak taxonomic hierarchy reaches a good balance between preserving information and interpretability.

The NICU data: Scree plots for HPAA (Percentage change in diversity vs. number of principal compositions).

Figure 2.4: The NICU data: Scree plots for HPAA (Percentage change in diversity vs. number of principal compositions).

2.3 Ordination plot

Finally, we use plotMDS() to construct ordination plot to visualize the changes in the between-sample distance patterns before and after HPAA with any given number of principal compositions. Specifically, in the provided function we perform the non-metric multidimensional scaling (NMDS) analysis with Bray–Curtis dissimilarity on the combined original data and the principal compositions from HPAA, which produces a low-dimensional ordination plot of all samples before and after amalgamation. For each sample, it is represented by a pair of points from either the original data or the principal compositions; the smallest circle that covers the pair is drawn, whose radius then indicates the level of distortion due to HPAA data reduction. The ordination plots from performing HPAA on the NICU data with three different loss functions and weak taxonomic hierarchy are shown in Figure 2.5, in which 20 principal compositions are kept (the number of PCs can be updated via the corresponding parameter of the plotMDS function). All three settings preserve the between-sample diversity reasonably well, as indicated by the fact that the circles generally have a small radius; as expected, HPAA with the BC loss performs the best as it directly targets on preserving between-sample diversity.

The NICU data: 2D NMDS ordination plots for comparing original and principal com- positions from HPAA with weak taxonomic hierarchy.

Figure 2.5: The NICU data: 2D NMDS ordination plots for comparing original and principal com- positions from HPAA with weak taxonomic hierarchy.